Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Identifieur interne : 000440 ( Main/Exploration ); précédent : 000439; suivant : 000441

Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Auteurs : Tomonari Masada [Japon] ; Atsuhiro Takasu [Japon] ; Yuichiro Shibata [Japon] ; Kiyoshi Oguri [Japon]

Source :

RBID : ISTEX:FDBFD7EB8D5D22EB3CE0996E9EA49BDEA3B33DFC

Abstract

Abstract: This paper proposes a semi-supervised bibliographic element segmentation. Our input data is a large scale set of bibliographic references each given as an unsegmented sequence of word tokens. Our problem is to segment each reference into bibliographic elements, e.g. authors, title, journal, pages, etc. We solve this problem with an LDA-like topic model by assigning each word token to a topic so that the word tokens assigned to the same topic refer to the same bibliographic element. Topic assignments should satisfy contiguity constraint, i.e., the constraint that the word tokens assigned to the same topic should be contiguous. Therefore, we proposed a topic model in our preceding work [8] based on the topic model devised by Chen et al. [3]. Our model extends LDA and realizes unsupervised topic assignments satisfying contiguity constraint. The main contribution of this paper is the proposal of a semi-supervised learning for our proposed model. We assume that at most one third of word tokens are already labeled. In addition, we assume that a few percent of the labels may be incorrect. The experiment showed that our semi-supervised learning improved the unsupervised learning by a large margin and achieved an over 90% segmentation accuracy.

Url:
DOI: 10.1007/978-3-642-24826-9_11


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Semi-supervised Bibliographic Element Segmentation with Latent Permutations</title>
<author>
<name sortKey="Masada, Tomonari" sort="Masada, Tomonari" uniqKey="Masada T" first="Tomonari" last="Masada">Tomonari Masada</name>
</author>
<author>
<name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
</author>
<author>
<name sortKey="Shibata, Yuichiro" sort="Shibata, Yuichiro" uniqKey="Shibata Y" first="Yuichiro" last="Shibata">Yuichiro Shibata</name>
</author>
<author>
<name sortKey="Oguri, Kiyoshi" sort="Oguri, Kiyoshi" uniqKey="Oguri K" first="Kiyoshi" last="Oguri">Kiyoshi Oguri</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:FDBFD7EB8D5D22EB3CE0996E9EA49BDEA3B33DFC</idno>
<date when="2011" year="2011">2011</date>
<idno type="doi">10.1007/978-3-642-24826-9_11</idno>
<idno type="url">https://api.istex.fr/document/FDBFD7EB8D5D22EB3CE0996E9EA49BDEA3B33DFC/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000B19</idno>
<idno type="wicri:Area/Istex/Curation">000B06</idno>
<idno type="wicri:Area/Istex/Checkpoint">000096</idno>
<idno type="wicri:doubleKey">0302-9743:2011:Masada T:semi:supervised:bibliographic</idno>
<idno type="wicri:Area/Main/Merge">000445</idno>
<idno type="wicri:Area/Main/Curation">000440</idno>
<idno type="wicri:Area/Main/Exploration">000440</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Semi-supervised Bibliographic Element Segmentation with Latent Permutations</title>
<author>
<name sortKey="Masada, Tomonari" sort="Masada, Tomonari" uniqKey="Masada T" first="Tomonari" last="Masada">Tomonari Masada</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki</wicri:regionArea>
<wicri:noRegion>Nagasaki</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author>
<name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
<affiliation wicri:level="3">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo</wicri:regionArea>
<placeName>
<settlement type="city">Tokyo</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author>
<name sortKey="Shibata, Yuichiro" sort="Shibata, Yuichiro" uniqKey="Shibata Y" first="Yuichiro" last="Shibata">Yuichiro Shibata</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki</wicri:regionArea>
<wicri:noRegion>Nagasaki</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author>
<name sortKey="Oguri, Kiyoshi" sort="Oguri, Kiyoshi" uniqKey="Oguri K" first="Kiyoshi" last="Oguri">Kiyoshi Oguri</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki</wicri:regionArea>
<wicri:noRegion>Nagasaki</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2011</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">FDBFD7EB8D5D22EB3CE0996E9EA49BDEA3B33DFC</idno>
<idno type="DOI">10.1007/978-3-642-24826-9_11</idno>
<idno type="ChapterID">11</idno>
<idno type="ChapterID">Chap11</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: This paper proposes a semi-supervised bibliographic element segmentation. Our input data is a large scale set of bibliographic references each given as an unsegmented sequence of word tokens. Our problem is to segment each reference into bibliographic elements, e.g. authors, title, journal, pages, etc. We solve this problem with an LDA-like topic model by assigning each word token to a topic so that the word tokens assigned to the same topic refer to the same bibliographic element. Topic assignments should satisfy contiguity constraint, i.e., the constraint that the word tokens assigned to the same topic should be contiguous. Therefore, we proposed a topic model in our preceding work [8] based on the topic model devised by Chen et al. [3]. Our model extends LDA and realizes unsupervised topic assignments satisfying contiguity constraint. The main contribution of this paper is the proposal of a semi-supervised learning for our proposed model. We assume that at most one third of word tokens are already labeled. In addition, we assume that a few percent of the labels may be incorrect. The experiment showed that our semi-supervised learning improved the unsupervised learning by a large margin and achieved an over 90% segmentation accuracy.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Japon</li>
</country>
<settlement>
<li>Tokyo</li>
</settlement>
</list>
<tree>
<country name="Japon">
<noRegion>
<name sortKey="Masada, Tomonari" sort="Masada, Tomonari" uniqKey="Masada T" first="Tomonari" last="Masada">Tomonari Masada</name>
</noRegion>
<name sortKey="Masada, Tomonari" sort="Masada, Tomonari" uniqKey="Masada T" first="Tomonari" last="Masada">Tomonari Masada</name>
<name sortKey="Oguri, Kiyoshi" sort="Oguri, Kiyoshi" uniqKey="Oguri K" first="Kiyoshi" last="Oguri">Kiyoshi Oguri</name>
<name sortKey="Oguri, Kiyoshi" sort="Oguri, Kiyoshi" uniqKey="Oguri K" first="Kiyoshi" last="Oguri">Kiyoshi Oguri</name>
<name sortKey="Shibata, Yuichiro" sort="Shibata, Yuichiro" uniqKey="Shibata Y" first="Yuichiro" last="Shibata">Yuichiro Shibata</name>
<name sortKey="Shibata, Yuichiro" sort="Shibata, Yuichiro" uniqKey="Shibata Y" first="Yuichiro" last="Shibata">Yuichiro Shibata</name>
<name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
<name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000440 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000440 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:FDBFD7EB8D5D22EB3CE0996E9EA49BDEA3B33DFC
   |texte=   Semi-supervised Bibliographic Element Segmentation with Latent Permutations
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024